Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

126 ◾ Bioinformatics

window passes a threshold, then that window will be identified as an active region. A mea-

sure like entropy may be used to measure the activity on the region.

The haplotypes are constructed from the reassembled reads following the identifica-

tion of the active regions. The de Bruijn-like graph is used to reassemble the active region

and to identify the possible haplotypes present in the alignments. Once the haplotypes are

determined, the original alignment of the reads will be ignored and the candidate haplo-

types are realigned to the haplotypes of the reference genome using the Smith-Waterman

local alignment. The pairwise alignment is also performed using Pairwise Hidden Markov

Model (PairHMM) which generates a likelihood matrix of haplotypes given. These likeli-

hoods are then marginalized to obtain the likelihoods of alleles for each potentially variant

site given the read data. The genotype or the most likely pair of alleles is then determined

for each position. For a given genotype (Gi) on a subset of overlapped reads (Ri), the variant

callers then use Bayesian statistics to evaluate the posterior probability of the hypothetic

phenotype (Gi) as follows:

P G R

P R G

P G

P R

(

)

(

)

(

)

(

)

(4.1)

where the posterior probability P G R

(

)

is the probability of the phenotype (Gi) given that

subset of reads (Ri), P Gi

(

) is the prior probability that we expect to observe the genotype

based on previous observations, P Ri

(

) is the probability of the subset of the reads being

true (the probability of observing the evidence), and P R G

(

) is the probability of reads

given the genotype. The Bayesian variant caller writes the above formula as:

P G R

P R G

P G

P R R

P G

∑

(

)

(

)

(

)

(

)

( |

)

(4.2)

We can ignore the denominator because it is the same for all genotypes. Thus

P G R

P R G

P G

(

)

(

)

∝

(

)

(4.3)

The variant callers use a flat prior probability that can be changed by the users if the

probabilities of the genotypes are known based on previous observations. The important

probability in the above formula is P R G

(

), which can also be described in terms of the

likelihood of the hypothesis of the genotype (Gi) given the reads (Ri):

P R G

L G R

L R

∏

(

)

(

)

(

)

(

)



^



^

(4.4)

where L R

(

)

1 ^and^R

(

)

2 ^{are the haplotype likelihoods.}

The likelihoods of all possible genotypes are calculated based on the alleles that were

observed at the site, considering every possible combination of alleles. Then, the most likely